Zurich - 27 & 28 June 2022

Missing data mechanisms (recap)

Assuming the missing data mechanism (MCAR)

The methods to deal with missing data, implicitly assume a missing data mechanism.

MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.

  1. Analyze the observed sample only (this will result in unbiased estimates).
  2. Use an imputation method to boost the power the the amount of missing data is too large.

Assuming the missing data mechanism (MAR)

MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).

  • Include variables in study that may explain the missing data, a MAR assumption may become more plausible (as compared to MNAR).
  • These auxiliary variables may also help in dealing with the missing data.
  • Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.
    • Can be used as predictors in an imputation model or as covariates in a FIML model to improve estimations.

Assuming the missing data mechanism (MNAR)

MNAR: least strict assumption.

  • MNAR data are also referred to as non-ignorable, because these cannot be ignored without causing bias in results. MNAR data are more challenging to deal with.

Ad hoc missing data methods

Complete-case analysis

Complete-case analysis (CCA): only the cases with observed data for all variables involved are used in the analysis.

  • The most common and easy way to deal with missing data.
  • The default method for many data analysis procedures.
  • Assumes MCAR

Parameter estimates:

  • Mean is unbiased for MCAR
  • Regression weight/correlation unbiased for MCAR
  • Standard errors are overestimated

CCA

MCAR | regression
estimate std.error p.value
Complete 0.22 0.09 0.02
MCAR 0.24 0.17 0.16

CCA

MAR | regression
estimate std.error p.value
Complete 0.22 0.09 0.02
MAR 0.09 0.15 0.53

Imputation

Imputation: replacing the missing with a value.

Possible values to impute?

  • Mean
  • Regression estimate
  • Random value
  • Previous observation (longitudinal study)
  • Informed guess / estimate

Assumption of single imputation

  • Most important: assumes that the imputed value is an actual observed value.
  • Some assume MCAR (i.e. regression imputation, random value imputation)

Other examples:

  • All missing values are average (i.e. mean imputation)
  • People with missing data remain stable (i.e. LOCF)

Example data

Weight
Performance
Gender n n mean sd n mean sd
Man 81 73 86.009 9.660 73 7.998 3.539
Woman 61 50 75.782 8.478 52 9.782 3.238

Mean imputation

  • Imputing the average value for all missing entries
  • Very easy method

Parameter estimates:

  • Mean is unbiased when MCAR
  • Regression weight/correlation never unbiased
  • Standard errors are underestimated

Mean imputation

Example

Regression imputation

Conditional mean imputation: estimate the imputed value as a predicted value from a regression model: \(Y_{imp} = \beta_0 + \beta_1 * X\)

Parameter estimates:

  • Mean is unbiased when MAR
  • Regression weight is unbiased when MAR
  • Correlation is never unbiased
  • Standard errors are underestimated

Regression imputation

Example

Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender\)

Regression imputation + sampling error

Stochastic regressions: regression imputation, with additional sampling error added to the predicted value:

  • \(Y_{imp} = \beta_0 + \beta_1 * X + \epsilon\)

Sampling error is normally distributed.

Parameter estimates:

  • Mean is unbiased when MAR
  • Regression weight is unbiased when MAR
  • Correlation is unbiased when MAR
  • Standard errors are underestimated

Imputation uncertainty is not taken into account

Regression imputation + residual error

Example

Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender + \epsilon\)

Last observation carried forward

Ad hoc method for longitudinal data: use the previous observed value to impute the missing values.

  • Assumes that people that drop out of the study remain stable.

Parameter estimates:

  • Mean is biased
  • Regression is biased
  • Correlation is biased
  • Standard errors are underestimated

Example longitudinal data

  • A RCT for a weight loss intervention.
  • Measurements at baseline, mid-treatment (3m), post-treatment (6m), and two follow-up moments (12m and 24m).
  • Missing data occurs at the measurements after the baseline, due to study drop-out.

>    id group 0 3 6 12 24   
> 76  1     1 1 1 1  1  1  0
> 16  1     1 1 1 1  1  0  1
> 9   1     1 1 1 1  0  0  2
> 7   1     1 1 1 0  0  0  3
> 2   1     1 1 0 0  0  0  4
>     0     0 0 2 9 18 34 63

LOCF

Example imputed

The trajectories over time, with LOCF imputations in red.

LOCF

Example imputed

Below only the cases with missing observations, separated by group.

Multiple imputation

Multiple imputation idea

  • Imputing the missing data entries with multiple “plausible” values.
  • Takes imputation uncertainty into account
  • A method to improve the main analysis results, so having a fully observed data set is not the goal.
    • Unbiased estimates
    • More precision and thus power
  • R package: mice

Multiple Imputation process

  • Imputation phase
  • Analysis phase
  • Pooling phase
  1. Incomplete data
  2. Generate multiple copies of the same dataset, but each time differenty imputed values
  3. Analyze each imputed dataset
  4. Pool results for analyses to final study result

Multiple Imputation process


Imputation phase

  • Imputing the missing data entries with multiple “plausible” values
  • Specify the imputation model
    • method to estimate imputed values
    • predictors used in the model
    • can both be specified for each variable with missing data

Algorithm for imputations

Fully Conditional Specification (FCS): imputes variable-by-variable.

  • For each variable the imputation model can be different:
    • method
    • predictors
  • Flexible method and allows imputations close to the data

Joint modelling: imputes from a multivariate distribution.

  • Computationally less intensive.
  • Good theoretical unterpinnings, provided that the model fits well with the data.

mice uses FCS

General method FCS

Per variable with missing data:

  • Impute missing value by using the predicted value estimated from observed data.
  • Add residual error to simulate sampling error
  • Depending on the variable type that needs to be imputed, the imputation method is defined.

Later more on methods

Predictors for imputation

Imputation model = method + predictors

  • For each variable, the most relevant predictors can be indicated
  • Use variables that are in the analysis model also in the imputation model (compatibility)
  • Add auxiliary variables
  • Advice to use no more than 25 predictors (also depends on sample size)

Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.

Iterations

  • One iteration = one cycle through all variables that need to be imputed.
    • The algorithm starts with a random draw from the observed data.
    • Then imputes the incomplete data in a variable-by-variable fashion.
  • Typically 5 to 10 iterations before an imputed data set is generated.
  • Iterations are repeated to ensure convergence of the algorithm

Iterations for data set 1

Iterations for generating the imputations for Solar.R in the first data set:

Iterations for data set 2

Iterations for generating the imputations for Solar.R in the second data set:

Iterations for 5 imputed datasets

Iterations for generating the imputations for Solar.R in all five data sets:

Convergence check

airquality illustration

  • Check the convergence of the algorithm for all imputed variables

Convergence check

airquality illustration

  • Use additional iterations when in doubt if convergence is reached

Convergence check

boys illustration

  • Convergence may need more iterations

Convergence check

boys illustration

  • Looks better for gen and tv, but there seems to be a problem for wgt.

Adapting imputation model

  • Changes in imputation model may be needed when:
    • Two variables are highly correlated (multi-collinearity)
    • A variable with so many missing values that the imputations are unstable

Number of imputations

  • Theoretically, higher \(m\) is better.
  • Usually, use \(m = 5\) for model building and \(m\) ~ % missing data for final analysis.
  • In lower \(m\) the between-imputation-variance is enlarged by \(\frac{1}{m}\) before computing the total variance.

Analysis phase

  • Each imputed data set is analyzed, with the substantive analysis model

  • This results in \(m\) sets of results

  • Workflow:

    • Imputation
    • Analysis
    • Pooling

Analysis example

> # A tibble: 10 x 6
>    term        estimate std.error statistic   p.value  nobs
>    <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
>  1 (Intercept)  24.8       5.71        4.34 0.0000258   153
>  2 Solar.R       0.0996    0.0278      3.58 0.000455    153
>  3 (Intercept)  21.7       5.62        3.85 0.000173    153
>  4 Solar.R       0.105     0.0271      3.88 0.000153    153
>  5 (Intercept)  21.3       5.53        3.86 0.000165    153
>  6 Solar.R       0.103     0.0267      3.87 0.000164    153
>  7 (Intercept)  21.8       5.55        3.93 0.000127    153
>  8 Solar.R       0.109     0.0271      4.02 0.0000930   153
>  9 (Intercept)  20.0       5.55        3.60 0.000433    153
> 10 Solar.R       0.112     0.0271      4.14 0.0000578   153

Pooling phase

  • Pool the analysis results to obtain final parameter estimates.
  • For normally distributed parameters: Rubin’s Rules

Rubin’s Rules

  • Pooling of point estimates that are normally distributed over the imputed datasets.

  • Means, standard deviations, regression estimates, linear predictors, proportions.

  • For pooling point estimates, use mean:

    \(\hat\theta = \frac{1}{m} \sum^m_{i=1}{\theta_i}\)

  • Pooling of variance or standard error around the estimate: combine the within and between imputation variance.

Variance pooling Rubins Rules

Between variance:

\(\sigma^2_{between} = \frac{\sum^m_{i=1}(\beta_i - \overline\beta)^2}{m-1}\)


Within variance:

\(\sigma^2_{within} = \frac{\sum^m_{i=1}\sigma^2_i}{m}\)


Total variance:

\(\sigma^2_{total} = \sigma^2_{within} + \sigma^2_{between} + \frac{\sigma^2_{between}}{m}\)

Pooling of non-normal parameters

  • Use a transformation to approximate a normal distribution
  • Examples:
    • Correlation: Fisher \(z\)
    • Odds Ratio, Relative Risk, Hazard ratio: log transformation
    • Explained variance: Fisher \(z\) for \(\sqrt{R^2}\)

Pooling example

> Class: mipo    m = 5 
>          term m   estimate         ubar            b            t dfcom
> 1 (Intercept) 5 21.9250591 3.128523e+01 3.114620e+00 3.502277e+01   151
> 2     Solar.R 5  0.1057732 7.371338e-04 2.371589e-05 7.655929e-04   151
>          df        riv     lambda       fmi
> 1  96.54006 0.11946675 0.10671755 0.1246657
> 2 136.72129 0.03860773 0.03717258 0.0509547

Pool output

  • riv: Relative increase in variance due to nonresponse
  • df: Residual degrees of freedom for hypothesis testing
  • lambda: Proportion of total variance due to missingness
  • fmi: Fraction of missing information for the parameter estimates

Imputation methods

Multiple imputation process


In the imputation phase, imputed values are estimated using an imputation method.

Methods for continuous variables

  • Regression imputation: norm.predict
  • Stochastic regression imputation (regression + residual error): norm.nob
  • Stochastic regression with parameter uncertainty: norm
  • Predictive mean matching: pmm

Example data

airquality

vars n mean sd median trimmed mad min max range skew kurtosis se
Ozone 1 116 42.129310 32.987884 31.5 37.797872 25.94550 1.0 168.0 167 1.2098656 1.1122431 3.0628482
Solar.R 2 146 185.931507 90.058422 205.0 190.338983 98.59290 7.0 334.0 327 -0.4192893 -1.0040581 7.4532881
Wind 3 153 9.957516 3.523001 9.7 9.869919 3.40998 1.7 20.7 19 0.3410275 0.0288647 0.2848178
Temp 4 153 77.882353 9.465270 79.0 78.284553 8.89560 56.0 97.0 41 -0.3705073 -0.4628929 0.7652217
Month 5 153 6.993464 1.416522 7.0 6.991870 1.48260 5.0 9.0 4 -0.0023448 -1.3167465 0.1145191
Day 6 153 15.803922 8.864520 16.0 15.804878 11.86080 1.0 31.0 30 0.0026001 -1.2224406 0.7166540

Regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)

  • Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.

  • And updated after the first imputation with the imputed data.

Regression imputation

Concept

Regression imputation

Univariate imputation

Regression imputation

Multivariate imputation

Stochastic regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)

  • Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.

  • And updated after the first imputation with the imputed data.

  • \(\epsilon\) is normally distributed residual error

Stochastic regression imputation

Concept

Stochastic regression imputation

Univariate

Stochastic regression imputation

Multivariate

Bayesian regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)

  • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.

  • \(\epsilon\) is normally distributed residual error

Bayesian regression imputation

Concept

Bayesian regression imputation

Univariate

Bayesian regression imputation

Multivariate

Predictive mean matching

Algorithm

  • Estimate the predicted value using Bayesian regression imputation
    • \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
    • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
    • \(\epsilon\) is normally distributed residual error
  • Select \(k\) nearest neighbors to this predicted value from the observed data
  • Randomly draw one donor to use as imputed value

Predictive mean matching

Concept

Predictive mean matching

Concept

Predictive mean matching

Univariate

Predictive mean matching

Multivariate

Methods for categorical data

  • Bayesian logistic regression: logreg
  • Bayesian polytomous regression: polyreg
  • Classification and regression trees: cart

Bayesian logistic regression

Algorithm

  • Imputed value: \(Log\frac{P(Y_{miss})}{1-P(Y_{mis})} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)

  • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.

  • \(\epsilon\) is normally distributed residual error

Bayesian logistic regression

Concept

Bayesian polytomous regression

Algorithm

  • Fit a multinomial regression model.

  • Parameters are drawn from their posterior distribution (Bayesian).

  • Compute the predicted category.

  • Add normally distributed residual error to account for sampling variance.

Bayesian polytomous regression

Concept

Classification and regression trees

Algorithm

For continuous or categorical variables

  • Draw a random bootstrap sample as training set.
  • Fit the tree model using the observed data.
  • Find the predicted terminal node for the missing value
  • Use the observed cases at the predicted terminal node as donors
  • Randomly draw the imputed value from the observed donor cases

Classification and regression trees

Concept

Classification and regression trees

Univariate

Classification and regression trees

Multivariate

Software demo mice

mice

library(mice)
help(mice)

Most important arguments in mice()

  • m: number of multiple imputations
  • method: imputation method
  • predictorMatrix: matrix where value of 1 means that the column variable is used as a predictor for the target in the rows.
  • maxit: number of iterations
  • seed: offsetting the random number generator

Summary of data

>     age          bmi          hyp          chl       
>  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
>  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
>  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
>             Mean   :26.56             Mean   :191.4  
>             3rd Qu.:28.93             3rd Qu.:212.0  
>             Max.   :35.30             Max.   :284.0  
>             NA's   :9                 NA's   :10

Multiple imputation

> 
>  iter imp variable
>   1   1  bmi  hyp  chl
>   1   2  bmi  hyp  chl
>   1   3  bmi  hyp  chl
>   1   4  bmi  hyp  chl
>   1   5  bmi  hyp  chl
>   2   1  bmi  hyp  chl
>   2   2  bmi  hyp  chl
>   2   3  bmi  hyp  chl
>   2   4  bmi  hyp  chl
>   2   5  bmi  hyp  chl
>   3   1  bmi  hyp  chl
>   3   2  bmi  hyp  chl
>   3   3  bmi  hyp  chl
>   3   4  bmi  hyp  chl
>   3   5  bmi  hyp  chl
>   4   1  bmi  hyp  chl
>   4   2  bmi  hyp  chl
>   4   3  bmi  hyp  chl
>   4   4  bmi  hyp  chl
>   4   5  bmi  hyp  chl
>   5   1  bmi  hyp  chl
>   5   2  bmi  hyp  chl
>   5   3  bmi  hyp  chl
>   5   4  bmi  hyp  chl
>   5   5  bmi  hyp  chl
>   6   1  bmi  hyp  chl
>   6   2  bmi  hyp  chl
>   6   3  bmi  hyp  chl
>   6   4  bmi  hyp  chl
>   6   5  bmi  hyp  chl
>   7   1  bmi  hyp  chl
>   7   2  bmi  hyp  chl
>   7   3  bmi  hyp  chl
>   7   4  bmi  hyp  chl
>   7   5  bmi  hyp  chl
>   8   1  bmi  hyp  chl
>   8   2  bmi  hyp  chl
>   8   3  bmi  hyp  chl
>   8   4  bmi  hyp  chl
>   8   5  bmi  hyp  chl
>   9   1  bmi  hyp  chl
>   9   2  bmi  hyp  chl
>   9   3  bmi  hyp  chl
>   9   4  bmi  hyp  chl
>   9   5  bmi  hyp  chl
>   10   1  bmi  hyp  chl
>   10   2  bmi  hyp  chl
>   10   3  bmi  hyp  chl
>   10   4  bmi  hyp  chl
>   10   5  bmi  hyp  chl

predictor matrix

>     age bmi hyp chl
> age   0   1   1   1
> bmi   1   0   1   1
> hyp   1   1   0   1
> chl   1   1   1   0

predictor matrix

Remove a predictor variable, e.g remove “bmi” as predictor for “chl”.

>     age bmi hyp chl
> age   0   1   1   1
> bmi   1   0   1   1
> hyp   1   1   0   1
> chl   1   0   1   0

methode per variable

>      age      bmi      hyp      chl 
>       ""    "pmm" "logreg"    "pmm"

Change the imputation method for one variable

>      age      bmi      hyp      chl 
>       ""   "norm" "logreg"    "pmm"

Run with new settings

> 
>  iter imp variable
>   1   1  bmi  hyp  chl
>   1   2  bmi  hyp  chl
>   1   3  bmi  hyp  chl
>   1   4  bmi  hyp  chl
>   1   5  bmi  hyp  chl
>   2   1  bmi  hyp  chl
>   2   2  bmi  hyp  chl
>   2   3  bmi  hyp  chl
>   2   4  bmi  hyp  chl
>   2   5  bmi  hyp  chl
>   3   1  bmi  hyp  chl
>   3   2  bmi  hyp  chl
>   3   3  bmi  hyp  chl
>   3   4  bmi  hyp  chl
>   3   5  bmi  hyp  chl
>   4   1  bmi  hyp  chl
>   4   2  bmi  hyp  chl
>   4   3  bmi  hyp  chl
>   4   4  bmi  hyp  chl
>   4   5  bmi  hyp  chl
>   5   1  bmi  hyp  chl
>   5   2  bmi  hyp  chl
>   5   3  bmi  hyp  chl
>   5   4  bmi  hyp  chl
>   5   5  bmi  hyp  chl
>   6   1  bmi  hyp  chl
>   6   2  bmi  hyp  chl
>   6   3  bmi  hyp  chl
>   6   4  bmi  hyp  chl
>   6   5  bmi  hyp  chl
>   7   1  bmi  hyp  chl
>   7   2  bmi  hyp  chl
>   7   3  bmi  hyp  chl
>   7   4  bmi  hyp  chl
>   7   5  bmi  hyp  chl
>   8   1  bmi  hyp  chl
>   8   2  bmi  hyp  chl
>   8   3  bmi  hyp  chl
>   8   4  bmi  hyp  chl
>   8   5  bmi  hyp  chl
>   9   1  bmi  hyp  chl
>   9   2  bmi  hyp  chl
>   9   3  bmi  hyp  chl
>   9   4  bmi  hyp  chl
>   9   5  bmi  hyp  chl
>   10   1  bmi  hyp  chl
>   10   2  bmi  hyp  chl
>   10   3  bmi  hyp  chl
>   10   4  bmi  hyp  chl
>   10   5  bmi  hyp  chl

Iteration plots

View imputed values

View imputed values linear

Analyze imputed data

View separate analysis results

> # A tibble: 15 x 6
>    term        estimate std.error statistic  p.value  nobs
>    <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
>  1 (Intercept)    182.       10.9    16.6   6.09e-14    25
>  2 age40-59        19.1      18.0     1.06  2.99e- 1    25
>  3 age60-99        39.4      18.9     2.08  4.90e- 2    25
>  4 (Intercept)    177.       11.0    16.1   1.14e-13    25
>  5 age40-59        21.3      18.0     1.18  2.51e- 1    25
>  6 age60-99        41.4      19.0     2.18  4.00e- 2    25
>  7 (Intercept)    184.       11.5    16.1   1.21e-13    25
>  8 age40-59        25.8      18.9     1.37  1.85e- 1    25
>  9 age60-99        56.8      19.8     2.86  9.02e- 3    25
> 10 (Intercept)    178.       11.0    16.1   1.11e-13    25
> 11 age40-59        26.2      18.1     1.44  1.63e- 1    25
> 12 age60-99        26.7      19.1     1.40  1.76e- 1    25
> 13 (Intercept)    183.       13.0    14.1   1.70e-12    25
> 14 age40-59        17.5      21.4     0.817 4.22e- 1    25
> 15 age60-99        54.6      22.5     2.43  2.39e- 2    25

Pool results

> Class: mipo    m = 5 
>          term m  estimate     ubar         b        t dfcom       df        riv
> 1 (Intercept) 5 180.61667 132.0368  11.14792 145.4143    22 17.69014 0.10131643
> 2    age40-59 5  21.98333 358.3857  15.34230 376.7964    22 19.03236 0.05137138
> 3    age60-99 5  43.78333 396.1105 151.14792 577.4880    22 10.34211 0.45789625
>       lambda       fmi
> 1 0.09199575 0.1797674
> 2 0.04886131 0.1352014
> 3 0.31408014 0.4169004